as_tbl_graph(edges)
tbl_graph(nodes, edges)
graph |>
activate(edges) |>
mutate(year = lubridate::year(casedate))Visual methods for exploring multivariate spatio-temporal networks with application to health transport
Confirmation Report
Background
Analysing spatio-temporal network data is a contemporary research problem that has gained increasing interest in the health field, particularly within emergency medical services (EMS) and ambulance transfer systems. Such data capture spatial, temporal, and often multivariate information. The spatial component generally represents geographic locations or spatial geometries, while the temporal component records time-related information through timestamps or time intervals (Rao, Govardhan, and Rao 2012). In addition, the underlying network structure creates connections and multivariate dependencies between locations and transfers. While techniques exist to analyse spatial and temporal components separately, performing analysis, and perhaps more importantly, exploring these components in conjunction with the network structure, remains an open challenge.
Older individuals often require continuous support, including 24-hour care, assistance with daily tasks, and ongoing medical supervision. Thus, many reside in the residential aged care facilities (RACFs), which are specifically designed to provide this comprehensive care (Kearney and Winterbottom 2006). RACFs frequently rely on the ambulance services to facilitate the transfers of an individual to the hospital for both acute emergencies and planned/scheduled medical appointments. This rise in the number of transfers is partly due to population ageing (Harris and Sharma 2018), which puts incredible pressure on emergency medical services, where delay could lead to an increase in health risk (Harmsen et al. 2015). During the COVID-19 pandemic, lockdown measures and movement restrictions further disrupted the delivery of emergency services. The effects of lockdowns and rising transfer demand highlight the need for further analysis to improve the planning and utilisation of ambulance services.
To gain insight into transfer patterns, data exploration using network representations linking RACFs and hospitals provides a powerful framework. However, most network research focuses primarily on topological properties, often treating them homogeneously and overlooking other important information, such as the association between variables (Cardenas et al. 2021; Fernández-Gracia et al. 2017). While network representation is suited to transfer data, overemphasising network topology can neglect the fundamental principles of data exploration. These limitations arise from the practical challenges of working with spatio-temporal network data, including data cleaning methods, particularly temporal information, the ease of data wrangling and subsetting, and the challenges of visualisation and inference. As a result, simple informative analyses, such as examining variable distributions, temporal trends, or bivariate relationships, are often underutilised, despite the ability to reveal key insights of the data. This underlines the need for an infrastructure that integrates network-based approaches with exploratory data analysis (EDA), enabling a comprehensive exploration of spatio-temporal transfer networks.
Studying how infectious diseases spread throughout the network (transfer between RACFs and hospitals) is important because the older population tend to face a higher risk of mortality during the outbreaks (Parohan et al. 2020). These patient transfers between facilities create ways for the disease to be transmitted across the systems, leading to rapid spread. Traditional compartmental infectious disease models assuming homogeneous or static structure do not adequately capture networks that change over time. In reality, ambulance transfers are highly dynamic, where these connections between facilities can change in response to the demand, constraints, and even outbreak conditions. Understanding these transmission dynamics is therefore crucial for devising effective policies to limit spread as well as identify high-risk facilities, critical transfer connections, and exposed periods.
Project 1: Developing Infrastructure for Exploratory Analysis of Multivariate Spatio-temporal Network with Application to Ambulance Transfers
Part A: Exploratory Data Analysis Infrastructure for Multivariate Spatio-temporal Network
As multivariate spatio-temporal network data become more accessible and complex, understanding their structure and dynamics is key to effective decision-making. As mentioned in Section 1, a major challenge in analysing large multivariate networks lies in the sheer amount of information it contains, most of which is often overlooked. This infrastructure aims to support the exploration of multivariate spatio-temporal network data. The exploratory data analysis involves several key processes: data storage, cleaning, subsetting, and visualisation. The following section, therefore, reviews existing tools that support these processes and discusses their limitations.
Data Storage and Cleaning
Data cleaning is the first stage of a reliable analysis. Spatio-temporal data usually need to be checked for inconsistency of the temporal records, duplicated records, and spatial inaccuracies. Now, adding the network structure on top of that, such as nodes, edges, and their attributes, requires the network topology to be kept throughout the process. Typically, this stage involves tools such as dplyr (Wickham et al. 2023) for manipulating the data, tsibble (Wang, Cook, and Hyndman 2020) for validating the temporal inconsistency, sf (Pebesma 2018) for checking the coordinate inaccuracies, and igraph/network (Csárdi et al. 2026; Butts 2008) for keeping the network structure.
The tidygraph (Pedersen 2024b) package provides a tidy API for graph and network manipulation, where network data is thought of as two tidy tables, one for node and one for edge data. In tidy data (Wickham 2014), each variable has its own column, each observation has its own row, and each value has its own cell. These tables are then stored together within a tbl_graph object, which preserves the underlying network topology while allowing standard dplyr verbs to be applied. The interaction between node and edge tables is done through the use of a special function, activate(), which allows the user to switch between the two tables and apply dplyr operations such as mutate(), group_by(), and join operations.
There are two main functions for creating tbl_graph object, as_tbl_graph() and tbl_graph(). The first function as_tbl_graph() takes in a different class of objects, such as data.frame, igraph, and network, then turns it into a tbl_graph object. While tbl_graph() takes in two data.frame objects, one for node and one for edge.
The difference between these two methods is that for the as_tbl_graph() function, it only needs the edges dataset, which means that all the multivariate information will only be on edge data and in the node data, it will only have the name (location). For the tbl_graph() function, the node variable can be explicitly stated, which can come in handy when there are attributes on the node dataset.
# A tbl_graph: 815 nodes and 102073 edges
#
# A directed acyclic multigraph with 6 components
#
# Edge Data: 102,073 × 9 (active)
from to casedate age gender diagnosis daytype single_id year
<int> <int> <date> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 575 659 2019-04-29 96 Female OTHER - SPECIFY weekda… 10097915 2019
2 514 628 2022-05-31 88 Female SHORT OF BREATH weekda… 13485226 2022
3 522 628 2020-10-03 90 Male LACERATION weeken… 11574122 2020
4 562 633 2020-12-18 91 Female PAIN weekda… 11813591 2020
5 562 640 2018-02-05 89 Female SEPSIS weekda… 8895777 2018
6 562 633 2021-03-31 92 Female PAIN weekda… 12134872 2021
7 562 640 2021-04-22 92 Female SHORT OF BREATH weekda… 12204603 2021
8 562 633 2019-07-20 97 Female URINARY TRACT IN… weeken… 10340046 2019
9 516 706 2021-01-13 98 Female NO PROBLEM IDENT… weekda… 11895939 2021
10 596 640 2020-07-04 90 Male ALTERED CONSCIOU… weeken… 11319542 2020
# ℹ 102,063 more rows
#
# Node Data: 815 × 4
name longitude latitude type
<chr> <dbl> <dbl> <chr>
1 1 ABERDEEN STREET RESERVOIR 145. -37.7 racf
2 1 ADENEY STREET CAMPERDOWN 143. -38.2 racf
3 1 AITKEN AVENUE DONALD 143. -36.4 racf
# ℹ 812 more rows
For spatial networks, the sfnetworks package (van der Meer et al. 2024) extends tidygraph by allowing spatial geometries to be incorporated directly within the tbl_graph object. It is useful for dealing with complex geometry where edges are not straight-line connections, such as road or transport networks. The package also allows for the standard spatial operation within the sf package to be performed within the network context.
However, the temporal data structure provided by tsibble is not directly compatible with tidygraph objects. As a result, validating temporal consistency requires converting data back to a tsibble object or performing a temporal check prior to the creation of tbl_graph. It introduces an important limitation, where common operations of filling missing observations are done outside the network context and therefore do not preserve the network topology. For example, if a node is missing in January 2020, how should the edges associated with that node be imputed? A sensible solution is to assume no edges exist during that period, which is reasonable in some cases but not in all cases. It highlights a key challenge in cleaning spatio-temporal network data, where temporal consistency and network structure should be considered jointly. The challenges require careful methodological decisions to ensure that both temporal attributes and the relational structure of the network remain coherent throughout the cleaning process.
Data Subsetting
Data subsetting is used to extract a subset of spatio-temporal network data based on spatial, temporal, and multivariate variables. This includes grouping data by time periods or regions, as well as filtering based on variable values and network characteristics (e.g., in-degree). In a network context, filtering operations need to account for topological dependencies between nodes and edges. When nodes are removed based on a condition, all edges incident to those nodes are also deleted (Figure 1). In contrast, when edges are removed, the nodes connected to those edges are preserved, since nodes can exist independently from an edge (Figure 2). The tidygraph supports these subsetting operations through the use of dplyr functions such as filter() and select(), which are applied separately on nodes and edges while maintaining the condition of the underlying network. Similarly to the data manipulation, users will need to switch between the node and edge tables to subset based on their attributes.
graph |>
activate(edges) |>
filter(between(year, 2020, 2021))# A tbl_graph: 815 nodes and 48606 edges
#
# A directed acyclic multigraph with 59 components
#
# Edge Data: 48,606 × 9 (active)
from to casedate age gender diagnosis daytype single_id year
<int> <int> <date> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 522 628 2020-10-03 90 Male LACERATION weeken… 11574122 2020
2 562 633 2020-12-18 91 Female PAIN weekda… 11813591 2020
3 562 633 2021-03-31 92 Female PAIN weekda… 12134872 2021
4 562 640 2021-04-22 92 Female SHORT OF BREATH weekda… 12204603 2021
5 516 706 2021-01-13 98 Female NO PROBLEM IDENT… weekda… 11895939 2021
6 596 640 2020-07-04 90 Male ALTERED CONSCIOU… weeken… 11319542 2020
7 279 769 2020-05-16 19 Male POST ICTAL weeken… 11183777 2020
8 231 645 2020-08-06 46 Male OTHER - SPECIFY weekda… 11422186 2020
9 231 645 2020-08-07 46 Male OTHER - SPECIFY weekda… 11415518 2020
10 265 657 2021-05-05 45 Male NO PROBLEM IDENT… weekda… 12244819 2021
# ℹ 48,596 more rows
#
# Node Data: 815 × 4
name longitude latitude type
<chr> <dbl> <dbl> <chr>
1 1 ABERDEEN STREET RESERVOIR 145. -37.7 racf
2 1 ADENEY STREET CAMPERDOWN 143. -38.2 racf
3 1 AITKEN AVENUE DONALD 143. -36.4 racf
# ℹ 812 more rows
Network Sampling
Another important aspect of subsetting is understanding how sampling methods perform on network data. Real-world datasets are often not evenly distributed across multiple dimensions such as time, space, or variable groups. Some strata may contain more observations than others, and analysing these can directly impact the interpretation, as the larger group of strata may dominate the patterns seen. Sampling provides a way to subset the data while keeping it representative of the population. Stratified sampling, inparticular, helps with an imbalance case by dividing the data into subgroups and sampling within each group, ensuring that all groups are represented in the sampled data.
In the network context, sampling methods are generally categorised into the following (Chuong Nguyen 2025):
Node-based sampling selects a subset of nodes from the network and retains edges that are incident to the sampled nodes. This method is efficient and is usually implemented in large-scale studies (Ben-Eliezer et al. 2022). It often fails to capture important global structural properties such as connectivity and clustering.
Edge-based sampling samples a subset of edges directly and includes the nodes incident to those edges. This method is better at preserving structural pattern (Jiao 2024). However, it may introduce bias towards selecting nodes with higher degrees, resulting in biased sampled data.
There are many additional methods for sampling. Hu and Lau (2013) provides a comprehensive survey and taxonomy of graph sampling approaches, which are outside the scope of this project.
The tidygraph package provides a method for sampling the data for a tbl_graph object through a sample_n() function, although it is now recommended to use slice_sample() instead. A further limitation of the tbl_graph is that it does not directly support stratified (i.e., group_by) sampling. Instead, the tbl_graph object needs to be converted back to tibble (Müller and Wickham 2025), performing stratified sampling on the node or edge table, and then filtering the original network based on the sampled nodes or edges. This limitation shows that sampling operations for network objects can still be improved.
set.seed(1)
# Edges sampling
graph |>
activate(edges) |>
sample_n(size = 20)
# Stratified edges sampling
edges_kept <- graph |>
activate(edges) |>
as_tibble() |>
group_by(daytype) |>
sample_n(size = 10) |>
pull(single_id)
graph |>
activate(edges) |>
filter(single_id %in% edges_kept) |>
activate(nodes) |>
filter(!node_is_isolated())# A tbl_graph: 33 nodes and 20 edges
#
# A rooted forest with 13 trees
#
# Node Data: 33 × 4 (active)
name longitude latitude type
<chr> <dbl> <dbl> <chr>
1 124 MACULATA DRIVE SHEPPARTON 145. -36.4 racf
2 130 DIMBOOLA ROAD WESTMEADOWS 145. -37.7 racf
3 133 CAIRNLEA DRIVE CAIRNLEA 145. -37.8 racf
4 15 COULSTOCK STREET EPPING 145. -37.7 racf
5 161 MALE STREET BRIGHTON 145. -37.9 racf
6 17 AMAROO WAY NEWBOROUGH 146. -38.2 racf
7 17 DERWENT STREET RINGWOOD 145. -37.8 racf
8 17 PARK DRIVE SUNSHINE NORTH 145. -37.8 racf
9 18 VILLA ROAD SPRINGVALE 145. -37.9 racf
10 33 FRANK STREET NOBLE PARK 145. -38.0 racf
# ℹ 23 more rows
#
# Edge Data: 20 × 9
from to casedate age gender diagnosis daytype single_id year
<int> <int> <date> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 18 29 2018-01-22 88 Female CHEST INFECTION weekda… 8857999 2018
2 13 29 2021-08-25 88 Male CHEST INFECTION weekda… 12593099 2021
3 16 21 2018-10-14 69 Male NO PROBLEM IDENTI… weeken… 9541818 2018
# ℹ 17 more rows
As discussed in Section 2.1.2, nodes in a network can exist independently without incident edges. Thus, the edge-based sampling does not automatically remove nodes that become isolated after sampling. To remove these nodes, they must be explicitly removed by filtering the node table using the node_is_isolated() function.
Data visualisation
Data visualisation helps reveal patterns, anomalies and relationships that may not be apparent from numerical summaries alone. Network data is often viewed as connections or flows between nodes/locations, and network-based visualisation allows for easier communication to a broader audience. For a simple network without spatial coordinates, placing nodes and edges in a visualisation requires the use of a graph layout algorithm, such as the Kamada-Kawai layout (Kamada and Kawai 1989). Depending on the chosen algorithm, the positions of nodes and edges can be different even on the same network dataset. With spatial information, visualising these becomes more straightforward, as longitude and latitude can be used to specify the actual location of the nodes, with edges represented as lines connecting these locations.
simple_graph |>
ggraph(x = long, y = lat) +
geom_sf(data = vic_map, color = "white") +
geom_edge_link(alpha = 0.1) +
geom_node_point(aes(color = category))
Visualising high-dimensional network data can be challenging, especially through a static visualisation alone. The current tool for network visualisation in R is the ggraph package (Pedersen 2024a), which extends the ggplot2 package (Wickham 2016) to support relational data structures such as networks, graphs, and trees. The ggraph package is effective at visualising static networks, offering a range of layout algorithms for placing the node locations while keeping the same familiar ggplot2 syntax. The support for interactive network visualisation with ggraph is currently limited. The reason static network visualisation is hard is that the amount of information that can be mapped to the visualisation is limited within a single figure. As shown in Figure 3, just a simple network representation can already become cluttered quickly. Answering detailed questions such as the number of transfers between a specific RACF and Hospital, or the name of a particular RACF, is difficult using static visualisation alone. Interactive visualisation help with these limitation by layering additional information onto the visualisation, allowing for further exploration.
interactive_vis_node <- simple_graph |>
mutate(name = str_remove(name, "'")) |>
ggraph(x = long, y = lat) +
geom_sf(data = vic_map, color = "white") +
geom_edge_link(alpha = 0.1) +
geom_point_interactive(aes(x = x,
y = y,
color = category,
tooltip = name,
data_id = name))
girafe(ggobj = interactive_vis_node,
options = list(
opts_hover(css = "fill:lightblue;stroke:grey;stroke-width:0.5px"),
opts_zoom(min = 0.5, max = 3)
))